When establishing relationships between variables of blockgroups, not all blockgroups may exhibit the same relationships. Clustering may allow us to identify different underlying behaviors. This html file is meant to open up the tool of clustering.
Take the data and get the relevant information out of it.
#Curate socialdistancing data
sj_outside <- sj_socialdistancing %>%
rename(blockgroup = origin_census_block_group) %>%
filter(weekend == F) %>%
filter(date > as.Date(shelter_start)) %>%
group_by(blockgroup) %>%
summarize(completely_home_device_count = sum(completely_home_device_count), device_count = sum(device_count)) %>%
mutate('perc_away'= 100 - (completely_home_device_count/device_count*100) %>% round(1))
# load in income data - code adapted from other students
sj_median_income_bg <-
getCensus(
name = "acs/acs5",
vintage = 2018,
region = "block group:*",
regionin = "state:06+county:085",
vars = "B19013_001E"
) %>%
mutate(
blockgroup =
paste0(state,county,tract,block_group)
) %>%
rename(
median_income = B19013_001E
)
# get data on vehicles available as vehicles allocation
sj_vehicles_bg <- getCensus(
name = "acs/acs5",
vintage = 2018,
region = "block group:*",
regionin = "state:06+county:085",
vars = "B992512_001E"
) %>%
mutate(
blockgroup =
paste0(state,county,tract,block_group)) %>%
rename(
total_vehicles = B992512_001E
)
This is how clustering works: if your data entries have three types of data, then imagine that you plot the data in a 3 dimensional space, with each dimension representing a different type of data. In our case, those three data are, percentage time spent outside, median income, and number of vehicles. The clustering algorithm then groups points according to spatial proximity.
For this proximity to work correctly, the data needs to be normalized. Since percentage of people away from home are always beteen 0 and 1, while number of vehicles may be in the 100,000s, if we don’t normalize, then the percentage of people away from home may dominate how clusters are formed. So we divide each data feature by its standard deviation.
sj_data puts all the data together.
norm_sj_data has all the normalized data.
sj_data <- sj_outside %>%
select(c('blockgroup','perc_away')) %>%
inner_join(sj_median_income_bg %>% select(c('median_income', 'blockgroup'))) %>%
inner_join(sj_vehicles_bg %>% select(c('total_vehicles', 'blockgroup'))) %>%
filter(median_income > 0 & total_vehicles > 0)
norm_sj_data <- sj_data %>%
mutate(median_income = median_income/sd(median_income),
total_vehicles = total_vehicles/sd(total_vehicles),
perc_away = perc_away/sd(perc_away),
blockgroup = NULL) #The blockgroup needs to be removed, else it affects clustering.
plot(norm_sj_data$median_income, norm_sj_data$perc_away)
In the plot above, observe the small negative relationship between income and percent of people away from home.
We are going to use the classical kmeans algorithm for clustering. One key input for our clustering algorithm is the number of clusters. This is not information we can know beforehand. So we just cluster for many different numbers and measure the sum of the ‘within sum of squares’ (withinss). withinss measures the sum of the pairwise distances within each pair, and naturally, good clustering should reduce the withinss values for all clusters.
So we do clustering for number of clusters going from 1 to 15, sum up the withinss values for each cluster for a given number of clusters, and see the point where there is maximum drop in the total withinss value.
wss <- kmeans(norm_sj_data,centers=1)$tot.withinss
# We take iteration 2 to 15
for (i in 2:15) wss[i] <- kmeans(norm_sj_data,centers=i)$tot.withinss
# We plot the 15 withinss values. One for each k
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
It seems that 5 is a good candidate. Although, generally you can’t go wrong with having one less cluster.
Now, we do the clustering with centers = 5. We use the factoextra package to visualize. In this visualization, we use the original sj_data, rather than the normalized version.
clusters <- kmeans(norm_sj_data,centers=5)
fviz_cluster(clusters, data = sj_data %>% select(-blockgroup))
Let us revisit the income to social distancing relationship within each cluster. In particular, the relationship seems to have strongly reversed for cluster 2.
sj_data$cluster <- clusters$cluster
sj_data_cluster1 <- sj_data %>% filter(cluster == 1)
sj_data_cluster2 <- sj_data %>% filter(cluster == 2)
sj_data_cluster3 <- sj_data %>% filter(cluster == 3)
sj_data_cluster4 <- sj_data %>% filter(cluster == 4)
sj_data_cluster5 <- sj_data %>% filter(cluster == 5)
plot(sj_data_cluster1$median_income, sj_data_cluster1$perc_away)
plot(sj_data_cluster2$median_income, sj_data_cluster2$perc_away)
plot(sj_data_cluster3$median_income, sj_data_cluster3$perc_away)
plot(sj_data_cluster4$median_income, sj_data_cluster4$perc_away)
plot(sj_data_cluster5$median_income, sj_data_cluster5$perc_away)
We can plot where these clusters are.
sj_blockgroups_clusters <- sj_blockgroups %>%
inner_join(sj_data[,c('blockgroup', 'cluster')], by = c('GEOID' = 'blockgroup'))
mapview(sj_blockgroups_clusters, zcol = 'cluster')